Contrastive Approach towards Text Source Classification based on Top-Bag-of-Word Similarity

نویسندگان

  • Chu-Ren Huang
  • Lung-Hao Lee
چکیده

This paper proposes a method to automatically classify texts from different varieties of the same language. We show that similarity measure is a robust tool for studying comparable corpora of language variations. We take LDC’s Chinese Gigaword Corpus composed of three varieties of Chinese from Mainland China, Singapore, and Taiwan, as the comparable corpora. Top-bag-of-word similarity measures reflect distances among the three varieties of the same language. A Top-bag-of-word similarity based contrastive approach was taken to solve the text source classification problem. Our results show that a contrastive approach using similarity to rule out identity of source and to arrive actual source by inference is more robust that directly confirmation of source by similarity. We show that this approach is robust when applied to other texts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

CSCR010: Second Year Report

The aim of my PhD research is focused on Text Mining, one major research school in Knowledge Discovery in Databases (KDD), and in particular Text Preprocessing (TPP) for classification / categorization of documents utilizing novel algorithms for the identification of hidden patterns, rules, regularities and trends within these documents. Significant techniques in Data Mining, another wellknown ...

متن کامل

Machine learning of syntactic parse trees for search and classification of text

We build an open-source toolkit which implements deterministic learning to support search and text classification tasks. We extend the mechanism of logical generalization towards syntactic parse trees and attempt to detect weak semantic signals from them. Generalization of syntactic parse tree as a syntactic similarity measure is defined as the set of maximum common subtrees and performed at a ...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008